Cap-sensitive text search for documents

ABSTRACT

A capitalization-sensitive (“Cap-sensitive”) search is a feature that provides the ability to specify as search criteria both a term and a capitalization characteristic or signature of the term. There are conventional approaches that enable a capital-sensitive approach to be performed from within a document.

RELATED APPLICATIONS

This application claims benefit of priority to Provisional U.S. PatentApplication No. 60/821,129, filed Aug. 1, 2006, entitled CAP-SENSITIVETEXT SEARCH FOR DOCUMENTS; the aforementioned priority application beinghereby incorporated by reference in its entirety.

TECHNICAL FIELD

The disclosed embodiments relate generally to the field of textsearching and retrieval.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A illustrates a method for searching for documents usingcapitalization characteristics, under an embodiment of the invention.

FIG. 1B illustrates a method for aggregating or formulating an index fora system that can handle search operations that include criteria thatspecifies one or more capitalization characteristics, under anembodiment of the invention.

FIG. 2 illustrates a system for enabling the searching for documentsusing capitalization characteristics, under an embodiment of theinvention.

FIG. 3A-3C illustrate a schema for classifying identified words andtokens for subsequent searching, based on capitalizationcharacteristics, under an embodiment of the invention.

FIG. 4 illustrates a method for enabling searching of documents throughuse of search criteria that may include capitalization characteristics,under an embodiment of the invention.

DETAILED DESCRIPTION

Overview

A capitalization-sensitive (“Cap-sensitive”) search is a feature thatprovides the ability to specify as search criteria both a term and acapitalization characteristic or signature of the term. There areconventional approaches that enable a capital-sensitive approach to beperformed from within a document. For example, MICROSOFT WORD enables anindividual to perform a Find function when a document is opened. TheFind function is used to find a specific character string in thedocument. The user can optionally make the search string“cap-sensitive”, to eliminate occurrences of the character string thatdo not have the specific capitalization characteristic.

Embodiments described herein provide enable text searching thataccommodates a search criteria corresponding to a capitalizationcharacteristic. In one embodiment, one or more search terms arereceived, and a determination is made as to a capitalizationcharacteristic of at least one search term. One or more documents areidentified from a collection of documents. The identification is basedat least in part on the determination of the capitalizationcharacteristic of the search term, so that the search result satisfiesthe criteria of the capitalization characteristic.

In another embodiment, a system is provided for performing a textsearch. In one embodiment, an index stores a plurality of entries, whereeach entry in the index corresponds to a text item of a particulardocument in a larger collection of documents. At least some of theentries include information about a capitalization characteristic of acorresponding text item so as to enable a search operation that isspecific to a capitalization criteria of the search operation.

In another embodiment, a search interface is provided in connection withthe index to handle a request that includes the capitalization criteria.

Still further, a document retrieval component is configured to accesslocal or network locations to retrieve and scan documents for text items(e.g. words) that correspond to entries that are to populate the index.

Embodiments described herein provide a mechanism by which cap-sensitivesearches can be performed in various environments that utilize indexesor similar data structures to aggregate words and search terms fromvarious sources.

Additionally, an embodiment enables the cap-sensitive search to beperformed for a text item that occurs in any one of a plurality ofsource documents.

Still further, an embodiment employs a cap-sensitive search for use in asearch engine on a network, such as at a search engine web site.

One or more embodiments described herein may be implemented through theuse of modules or software/logic components. A module refers to aprogram, a subroutine, a portion of a program, a software component,firmware or a hardware component capable of performing a stated task orfunction. A module can exist on a hardware component such as a serverindependently of other modules, or a module can exist with other moduleson the same server or client terminal, or within the same program. Amodule may be implemented on a client or on a server, or distributedbetween clients, servers, or amongst a client-server.

Furthermore, one or more embodiments described herein may be implementedthrough the use of instructions that are executable by one or moreprocessors. These instructions may be carried on a computer-readablemedium. Services and components illustrated by figures in thisapplication provide examples of processing resources andcomputer-readable mediums on which instructions for implementingembodiments of the invention can be carried and/or executed. Inparticular, the numerous machines shown with embodiments of theinvention include processor(s) and various forms of memory for holingdata and instructions. Examples of computer-readable mediums includepermanent memory storage devices, such as hard drives on personalcomputers or servers. Other examples of computer storage mediums includeportable storage units, such as CD or DVD units, flash memory (such ascarried on many cell phones and PDAs), and magnetic memory. Acomputer-readable medium as used herein may extend across multiplemachines. For example, the medium may be distributed between client andserver in order to perform a stated task or operation.

Methodology

FIG. 1A illustrates a method for performing a cap-sensitive search,under one or more embodiments of the invention. In a step 10, a searchterm is received from a user. For example, the search term may bespecified by a user operating a terminal over the Internet. The searchrequest may correspond to a word, phrase, portion of a word, acronym,proper noun, or other character string that includes or does not includecapitalization. In providing the search term, the user may entercharacters through use of a web browser or interface, or alternative,through use of client application.

Step 120 provides that a determination is made as to whether the searchterm includes a capitalization characteristic. Under one or moreembodiment, the capitalization characteristics that may be detectedinclude (i) a word or term that includes all capitalization (“allcaps”), or (ii) a word or term that has one character capitalized, or ispartially capitalized. As described with one or more embodiments, thetypes of capitalization may be classified or grouped. Alternatively, thecapitalization of a term may be specific to the position and charactersthat are capitalized.

If the determination of step 120 is that a capitalization characteristicis present in the search term, then step 130 provides that a selectionis made for one or more documents that include (i) a matching orqualifying text item (ii) having a capitalization characteristic thatsatisfies the capitalization characteristic specified in the searchterm. Under one embodiment, the capitalization characteristic thatsatisfies that of the search term may be an exact match. For example,the search term “Bush” may return documents that include “Bush”, but not“BUSH” or “bush”. Alternatively, the capitalization characteristic ofthe text item that satisfies the criteria of the search may simply matchto a capitalization class that is specified to be a match. For example,“McDonald” as a search term may be matched to a class of text items thatinclude the characters with any letter capitalized (e.g. MCDONALD orMcDONALD).

If the determination of step 120 is that a capitalization characteristicis not present in the search term, then step 140 provides that aselection is made for one or more documents that include a matching orqualifying text item. In one embodiment, matching or qualifying textitems may or may not include a capitalization characteristic. Forexample, the search “mit” would return documents containing “MIT” or“Mit” or “mit”, and the search term “MIT” would return only documentscontaining “MIT”. Alternatively, one embodiment provides that thematching or qualifying text item does not include any capitalizationcharacteristic. Thus, “mit” would return documents containing “mit” butnot “MIT”.

Embodiments described herein include a technique for enabling acapitalization-sensitive search to be performed on text content. Underone embodiment, one or more search terms are received. A determinationis made as to a capitalization characteristic of at least a first of theone or more search terms. From a plurality of documents, one or moredocuments are identified based at least in part on the determination ofthe capitalization characteristic.

FIG. 1B illustrates a method for aggregating or formulating an index fora system that can handle search operations that include criteria thatspecifies one or more capitalization characteristics, under anembodiment of the invention. A method such as described with FIG. 1B maybe used to identify documents that contain text items that satisfy acriteria that also specifies a capitalization characteristic.

In a step 150, a source for text content (e.g. document) is accessed.The text content is scanned for text items, such as words. Such aprocess may correspond to tokenizing a stream of text. The resultingtext items may correspond to words or even phrases.

In scanning the text content, step 160 provides that a capitalizationcharacteristic of individual text items is determined. In oneembodiment, the presence of one or more capital letters is determinedfrom the text.

In step 170, the capitalization characteristic determined from step 160is classified into one of a plurality of groups. In one embodiment,three classes of capitalization characteristics are used: (i) a class ofno capitalization in the text item, corresponding to none of the lettersor characters that comprise the text are capitalized; (ii) a class ofpartial capitalization in the text item, in which some, but not all thecharacters that comprise the text item are capitalized; and (iii) aclass of all capitalization in the text item, in which all thecharacters in the text item are capitalized. One or more embodimentscontemplate additional classes that can be used, such as a class todistinguish when there is only one capitalized letter is positioned atthe beginning of the word.

In step 180, entries are recorded in the index to associate text itemsto the source (or portions thereof) that contained the text items. Theindex can then be used to find entries that match a search term, andidentify documents or other text content associated with the entriesthat match the search term. In an embodiment, the entries reflect orrecord the class of the capitalization characteristic for individualtext items. According to one embodiment, the entries reflect or recordthe class of the capitalization characteristic by having duplicativeentries to reflect the capitalization characteristic of a text item thathas capitalization. One embodiment provides that (i) text items with theclassification of no capitalization has only one entry reflecting thetext item with no capitalization; (ii) text items with theclassification of partial capitalization have two entries-one entryreflecting the text item with partial capitalization and one entryreflecting the text item with no capitalization; and (iii) text itemswith the classification of complete or all capitalization have threeentries-one entry to reflect the text item with no capitalization, oneentry reflecting the text item with partial capitalization, and oneentry reflecting the text item with all capitalization.

In this way, when the search term is received, step 190 provides thatthe class of the capitalization characteristic of the search term isdetermined. Thus, in an embodiment such as described above, the searchterm may be classified as having no capitalization, partialcapitalization, or all capitalization.

Step 195 provides that the search term is matched to entries of the samecapitalization class. Thus, if the search term specifies nocapitalization characteristic, only those entries that are of the nocapitalization class are used in the comparison operation. But becauseduplicative entries are used to reflect capitalization, the documentsthat contain the same term in any form of capitalization (including allcapitalization) are returned in the search result. Likewise, if thesearch term contains an all capitalized term, only entries of the classof all capitalization will be compared against. It follows that thedocument that is returned will have the search term in the same allcapitalization form.

A method such as described with FIG. 1B may be used to develop an indexcontaining entries for search operations. The entries of the index mayreflect capitalization characteristics, and in turn, enable searchoperations that have some sensitivity to capitalization. Additionaldetails of such an index, and a system for implementing the index, areprovided below in more detail.

System Description

FIG. 2 illustrates a system for enabling a cap-sensitive search to beperformed for documents and other text sources available on a network,under an embodiment of the invention. In an embodiment of FIG. 2, asystem includes a document retrieval component 210, an index 220, and asearch interface 230. The document retrieval component 210 accessesdifferent network sites 204 or locations for content. Examples ofnetwork sites include web sites on which text content 206 is provided,including blogs, and message boards. Text content 206 are identified andretrieved from different sources 206 that contain such content. A userterminal 202 may communicate with the system via a network connection203. For example, the search interface 230 may include a component thatis in the form of a web page that downloads on the terminal when theuser navigates to a particular web site using a web page.

In an embodiment, a capitalization characteristic determinator (CCD) 222executes with or in association with the document retrieval component210. The document retrieval component 210 tokenizes a stream of textthat is identified from the text item 212. From the tokenizationprocess, tokens of text are identified. Under one implementation, thetext tokens may correspond to words or other discrete character strings.The CCD 222 inspects the text tokens to determine whether any of thetokens have capitalization. For example, ASCII or other text dataembedded in the text content 212 may be flagged when determined to be incapitalized form.

The document retrieval component 210, including the output of the CCD222, store entries 242 in the index 230. In one embodiment, each entry242 corresponds to a word or other text token. In addition, informationabout the capitalization characteristic of the text token is stored inthe index 230. According to one or more embodiments, when a text tokensare identified to contain capitalization, multiple entries 242, 244 arestored in the index 230 for that token. More specifically, one entry 242corresponds to the word/token with no capitalization characteristic,while at least one other entry 244 reflects use of the token/word aspart of a capitalization class (e.g. all-cap, or proper noun etc.) Asdescribed with FIG. 2, index 230 may be structured to provide a schemaor hierarchy defining the treatment of capitalization characteristics inthe tokens.

In one embodiment, when capitalization is identified from a token,multiple entries for that token are created. One embodiment provides afirst entry for use of the token as a word or other character stringwith no capitalization characteristics. Another entry provides for useof the token as a word or character string with some information orclassification of the capitalization characteristic. For example, thesecond entry may carry the exact capitalization characteristic asprovided by the source text content 212, and a class designation thatdefines whether the capitalization characteristic is either (i) anall-cap form, (ii) proper noun form (i.e. first character iscapitalized) or (iii) some other capitalization. In another embodiment,the classification of the capitalization characteristic is one of eithera designation of the all-cap form, or any other form withcapitalization. Numerous other variations are also possible.

While embodiments such as described by FIG. 2 provide for a networkenvironment, other variations are contemplated in which the networkenvironment may be omitted entirely or in part. For example, thedocument retrieval component 110, or its equivalent, may operate on alocal data source, such as a personal computer. Index 120 may includeentries from documents that are both local or on the network sites, oralternatively just local. Furthermore, while an embodiment such asdescribed with FIG. 2 describes the search interface 230 as being a webpage that is downloaded on the user terminal 202, the particularlocation of the search interface 230 or the index 220 may vary betweenclient and server. In a stand-alone variation, the search interface 230corresponds to an application that executes on the terminal 202 andoperates on an index of local documents and content. Numerous othervariations are also contemplated.

FIG. 3A-FIG. 3C illustrate the use of one schema for identifying andrecording the presence of capitalization characteristics in anidentified text token of a document, under one or more embodiments ofthe invention. In FIG. 3A-3C, an identified token 302 is provided one ormore entries, depending on the capitalization characteristic of thetoken 302. Under one embodiment, each identified entry 312 may havebetween one and three class designations 304. A separate classdesignation 304 may be provided for the case where (i) the token haseach character in a capitalized form, (ii) the token has at least one,but less than all characters in a capitalized form, and (iii) the tokenhas no characters in the capitalized form. In other variations, more offewer class designations may be defined, such as classes the distinguishwhen only the first letter is capitalized, as opposed to an alternativeset of lettering in the token.

In FIG. 3A, the identified token 302 (MIT) has only capitalization inits lettering, matching the all-cap class designation. According to aschema such as shown, the all-cap characteristic results in separateentries 304 for each defined class of capitalization (all cap, partialcap and no cap). Each of the entries 304 is stored in the index 220(FIG. 2) to point to the document where the token in all-cap form wasidentified.

FIG. 3B provides a proper noun usage, which under the schema described,corresponds to the partial cap designation. The entries 304 are providedthe partial-cap and no-cap designations, but not the all-capdesignation. Each of the two entries is stored in the index 220 (FIG.2), and point to the document where the token in the partial cap formwas identified.

In FIG. 3C, a non-cap usage is shown. Only one entry 304 is provided inthe index 220 (FIG. 2), and it points to the document where the token inthe non-cap form was found.

FIG. 4 illustrates a method for performing a search using a schema suchas shown and described with FIG. 3A-3C, under one or more embodiments ofthe invention. In FIG. 4, a step 410 provides that a user's search termis entered. For example, the search interface 230 (FIG. 2) may receivethe search term from the user.

Step 420 provides that the search term is tokenized, similar to how thetext content from the various sources are tokenized. As a result, wordsor phrases or identified from the user's search term.

In step 430, the capitalization characteristic of each token in thesearch term is determined. Referencing an embodiment of FIG. 3A-FIG. 3C,step 440 provides that one or more matching entries are selected fromthe index 220 (FIG. 2). In step 450, a search result is providedidentifying or including documents that satisfy the search term and acapitalization criteria or condition specified therein. As a result ofthe creation of multiple entries reflecting different capitalizationcharacteristics (as described with FIG. 3A-3C), the use of multipleentries to reflect cap-sensitivity results in the following matches:

Search Term: Result mit documents containing “mit” AND documentscontaining “Mit” AND documents containing “MIT” Mit documents containing“MIT” AND documents containing “Mit” MIT documents containing only “MIT”

CONCLUSION

Although illustrative embodiments of the invention have been describedin detail herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments. As such, many modifications and variations will be apparentto practitioners skilled in this art. Accordingly, it is intended thatthe scope of the invention be defined by the following claims and theirequivalents. Furthermore, it is contemplated that a particular featuredescribed either individually or as part of an embodiment can becombined with other individually described features, or parts of otherembodiments, even if the other features and embodiments make nomentioned of the particular feature. This, the absence of describingcombinations should not preclude the inventor from claiming rights tosuch combinations.

1. A method for performing a text search, the method comprising:receiving one or more search terms; making a determination of acapitalization characteristic of at least a first of the one or moresearch terms; and identifying, from a plurality of documents, one ormore documents based at least in part on the determination of thecapitalization characteristic.
 2. The method of claim 1, wherein makinga determination of a capitalization characteristic includes at least one(i) determining that no capitalization characteristic is present in thesearch term, and (ii) determining that the capitalization characteristicis present in the search term.
 3. The method of claim 2, whereindetermining that the capitalization characteristic is present in thesearch term includes determining a capitalization characteristic classfrom two or more possible capitalization characteristic classes.
 4. Themethod of claim 3, wherein the capitalization class includes a firstclass in which each character of the at least first search term iscapitalized.
 5. The method of claim 1, wherein making a determination ofa capitalization characteristic includes determining that all charactersin the at least first search term are capitalized.
 6. The method ofclaim 1, wherein making a determination of a capitalizationcharacteristic includes determining that at least one character in theat least first search term is capitalized.
 7. The method of claim 1,wherein making a determination of a capitalization characteristicincludes identifying one or more individual characters in the at leastfirst search term that are capitalized.
 8. The method of claim 1,wherein identifying one or more documents based at least in part on adetermination of the capitalization characteristic includes identifyingthe one or more documents based on a match of the at least first searchterm with a specified capitalization characteristic contained in the oneor more identified documents.
 9. The method of claim 1, whereinidentifying one or more documents based at least in part on adetermination of the capitalization characteristic includes identifyingthe one or more documents based on (i) the capitalization characteristicof the at least first search term being of a given capitalization class,and (ii) the one or more documents containing the at least first searchterm in the given capitalization class.
 10. The method of claim 9,wherein the given class of the capitalization characteristic correspondsto a class selected from (i) a first class in which all characters inthe at least first search term being capitalized, (ii) a second class inwhich at least one but not all characters in the at least first searchterm are capitalized, and (iii) a third class that includes the firstclass and the second class, and in which at least one character in thefirst search term is capitalized.
 11. The method of claim 1, whereinmaking a determination of a capitalization characteristic includesdetermining that the at least first search term has the capitalizationcharacteristic for a class corresponding to one of: (i) a first classcorresponding to all characters in the at least first search term beingcapitalized, (ii) a second class corresponding to at least one but notall characters in the first search term being capitalized, and (iii) athird class corresponding to none of the characters in the first searchterm being capitalized.
 12. The method of claim 11, wherein identifyingone or more documents based at least in part on the determination of thecapitalization characteristic includes: for when the at least firstsearch term is of the first class, identifying the one or more documentsincludes identifying only those documents in the plurality of documentsthat contain the at least first search term with the capitalizationcharacteristic of the first class; for when the at least first searchterm is of the second class, identifying the one or more documentsincludes identifying only those documents in the plurality of documentsthat contain the at least first search term with the capitalizationcharacteristic of the first class or the second class; and for when theat least first search term is of the third class, identifying the one ormore documents includes identifying the documents in the plurality ofdocuments that contain the at least first search term with thecapitalization characteristic of any of the first, second, or thirdclasses.
 13. A method of claim 1, wherein the method is performed by oneor more processors that execute instructions, and wherein the pluralityof documents are stored or cached locally on a computer-readable mediumthat is accessible to the one or more processors.
 14. A method of claim1, wherein the method is performed by one or more processors thatexecute instructions, and wherein the plurality of documents are locatedon a plurality of locations that are accessible to the one or moreprocessors through a network.
 15. A method for enabling a text search,the method comprising: for each document in a plurality of documents,identifying a plurality of text items, recording an entry in an datastructure for each identified text item, and responsive to an identifiedtext item having one or more characters that are capitalized, recordinginformation in the data structure indicating a capitalizationcharacteristic of the given text item.
 16. The method of claim 15,further comprising enabling the data structure to be used for a searchoperation in which a search criteria specifies a capitalizationcharacteristic or classification.
 17. The method of claim 15, whereinrecording information in the data structure corresponds to recordingmultiple entries for each text item that has one or more characters thatare capitalized.
 18. The method of claim 17, wherein recording multipleentries includes recording one of the multiple entries as being of aclass with a corresponding capitalization characteristic for that class.19. The method of claim 18, wherein recording multiple entries includesrecording an other of the multiple entries as having no capitalizationcharacteristic.
 20. The method of claim 15, wherein recordinginformation in the data structure includes: responsive to text itemhaving no capitalization characteristic, storing one entry for the textitem representing no capitalization characteristics in that text item;responsive to text item having a capitalization characteristic in whichat least one, but not all characters are capitalized, storing a firstentry for the text item representing no capitalization characteristicsin that text item, and a second entry representing that at least one,but not all characters are capitalized; and responsive to text itemhaving a capitalization characteristic in which all characters arecapitalized, storing a first entry for the text item representing nocapitalization characteristics in that text item, a second entryrepresenting that at least one, but not all characters are capitalized,and a third entry representing that all characters of that entry arecapitalized.
 21. A system for performing a text search, the systemcomprising: an index that stores a plurality of entries, wherein eachentry corresponds to a text item of a document, and wherein at leastsome of the entries represent information about a capitalizationcharacteristic of a corresponding text item so as to enable a searchoperation to be performed that specifies a capitalization criteria. 22.The system of claim 21, further comprising a search interface to handlea request that includes the capitalization criteria.
 23. The system ofclaim 21, further comprising a document retrieval component configuredto access at least a first location to identify the plurality of entriesfrom a plurality of documents.
 24. The system of claim 23, wherein thedocument retrieval component is configured to access a plurality ofnetwork locations to identify the plurality of documents.
 25. The systemof claim 24, further comprising a capitalization characteristicdetermination module, wherein the characteristic determination moduleinspects the plurality of documents to determine a capitalizationcharacteristic of individual text items in each of the plurality ofdocuments.
 26. The system of claim 25, wherein the capitalizationcharacteristic determined by the capitalization characteristic modulecorresponds to determining whether an individual text item has nocapitalization, partial capitalization, or all capitalization.
 27. Thesystem of claim 26, wherein for each text item that includes thecapitalization characteristic of partial capitalization or allcapitalization, multiple entries are included in the index to record thecapitalization characteristic.
 28. The system of claim 27, wherein foreach text item that includes the capitalization characteristic of allcapitalization, a first duplicative entry is stored in the index torepresent a partial capitalization characteristic of the text item and asecond duplicative entry is stored in the index to represent an allcapitalization characteristic of the text item; and wherein for eachtext item that includes the capitalization characteristic of partialcapitalization, only a first duplicative entry is stored in the index torepresent the partial capitalization characteristic of the text item.29. The system of claim 28, wherein the search interface performs theoperation by: for when a search term specifies a criteria of the allcapitalization characteristic, the search operation on the indexidentifies only one or more entries that identify one or more documentscontaining matching text items having the all capitalizationcharacteristic of the search term; for when the search term specifies acriteria of the partial capitalization characteristic, the searchoperation on the index identifies only one or more entries that identifyone or more documents containing matching text items having either theall capitalization characteristic or the partial characterizationcharacteristic; and for when the search term specifies a criteria withno capitalization characteristic, the search operation on the indexidentifies any entry that represents matching text items having any ofthe all capitalization characteristic, partial characterizationcharacteristic, or no capitalization characteristic.
 30. Acomputer-readable medium carrying instructions for performing a textsearch, the instructions including instructions, that when executed byone or more processors, cause the one or more processors to performsteps comprising: receiving one or more search terms; making adetermination of a capitalization characteristic of at least a first ofthe one or more search terms; and identifying, from a plurality ofdocuments, one or more documents based at least in part on thedetermination of the capitalization characteristic.